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METHOD AND APPARATUS FOR and any one of the central processing units (i.e., CPU 1 or 

EMPLEMENTING PARALLEL OPERATIONS CPU n) can use any memory resource (i.e.. Memory 1 to 

IN A DATABASE MANAGEMENT SYSTEM Memory n) or any disk storage (i.e.. Disk Storage 1 to Disk 

4 Storage n). However a shared everything hardware archi- 
This i^a continuation of application Ser. No. 08/44 L527. 5 tecture cannot scale. That is, a shared everything hardware 

filed May \5, 1995, now abandoned, which is a continuation architecture is feasible when the number of processors is 

of appUcatibn Ser. No. 08/127.585, filed Sep, 27, 1993. kept at a minimal number of twenty to thirty processors. As 

* the number of processors inaeases (e.g., above thirty), the 

BACKGROUND OF THE INVENTION performance of the shared everything architecture is limited 

10 by the shared bus (e.g., bus 102 in FIG. lA) between 

1. Field of the Invention processors and memory. This bus has limited bandwidth and 
This invention relates to the field of parallel processing in ^jje current state of the art of shared everything systems does 

a database environment. not provide for a means of increasing the bandwidth of the 

2. Background Art shared bus as more processors and memory are added. Thus, 
Sequential query execution uses one processor and one 15 only a fixed number of processors and memory can be 

storage device at a time. Parallel query execution uses supported in a shared everything architecture, 

multiple processes to execute in paraUel suboperations of a SUMMARY OF THE INVENTION 
query. For example, vntually every query execuUon mcludes 

some form of manipulation of rows in a relation, or table of The present invention implements parallel processing in a 
the DBMS. Before any manipulation can be done, the rows 20 Database Management System The present invention does 

must be read, or scanned. In a sequential scan, the table is not rely on physical partitioning to determine the degree of 

scanned using one process. parallelism. Further, the present invention does not need to 

Parallel query systems provide the ability to break up the use read lock, or require a two-phased commit in transaction 

scan such tiiat more than one process can perform the table processing because transaction and recovery information is 

scan. Existing parallel query systems are implemented in a located on multiple disks. 

shared nothing, or a shared eveiything environment. In a The present invention provides the ability to dynamically 
shared nothing environment, each computer system is com- partition row sources for parallel processing. That is, parti- 
prised of its own resources (e.g., memory, central processing tioning identifies the technique for directing row sources to 
unit, and disk storage). FIG. IB illustrates a shared nothing one or more query slaves. The present invention does not 
hardware architecture. The resources provided by System rely on static partitioning (i.e., partitioning based on the 
one are used exclusively by system one. Similarly, system n storage location of the data). 

uses only those resources included in system n. The present invention can be implemented using any 

Thus, a shared nothing environment is comprised of one architecture (i.e., shared nothing, shared disk, and shared 

or more autonomous computer systems that process their everything). Flirther, the present invention can be used in a 

own data, and transmit a result to another system Therefore, software-implemented shared disk system (see FIG. ID). A 

a DBMS implemented in a shared nothing environment has software-implemented shared disk systems is a shared noth- 

an automatic partitioning scheme. For example, if a DBMS ing hardware architecture combined with a high bandwidth 

has partitioned a table across the one or more of the communications bus (bus 106 in FIG. ID) and software that 
autonomous computer systems, then any scan of the table ^ allows blocks of data to be efficientiy transmitted between 

requires multiple processes to process the scan. systems. 

This method of implementing a DBMS in a shared A central scheduling mechanism minimizes the resources 

nothing environment provides one technique for introducing needed to execute an SQL operation. Further, a hardware 

parallelism into a DBMS environment However, using the architecture where processors do not directly share disk 
location of the data as a means for partitioning is limiting. 45 architecture can be programmed to appear as a logically 

For exan:^)le, the type and degree of parallelism must be shared disk architecture to other, higher levels of software 

determined when the data is initially loaded into the DBMS. via mechanisms of passing disk input/output requests indi- 

Thus, there is no ability to dynamically adjust the type and rectly from processor to processor over high bandwidth 

degree of parallelism based on changing factors (e.g., data shared notiiing networks. 

load or system resource availability), At compilation time, a sequential query execution plan is 

Further, using physical partitioning makes it difficult to generated. Then, the execution plan is examined, from the 

mix parallel queries and sequential updates in one transac- bottom up, to determine those portions of the plan that can 

tion without requiring a two phase commit. These types of be parallelized. Parallelism is based on the ability to paral- 

systems must do two-phase commit because data is located lelize a row source. Further, the partitioning requirements of 
on multiple disks. That is, transaction and recovery infor- 55 consecutive row sources and the partitioning requirements 

mation is located on multiple disks. A shared disk logical of the entire row source tree is examined. Further, the 

software architecture avoids a two-phase commit because all present invention provides the ability for the SQL statement 

processes can access all disks (see FIG. ID). Therefore, to specify the use and degree of parallelism, 

recovery information for updates can be written to one disk, a Query Coordinator (QC) process assumes control of the 
whereas data accesses for read-only accesses can be done ^ processing of a query. The QC can also execute row sources 

using multiple disks in parallel, that are to be executed serially. Additional threads of control 

Another hardware architecture, shared everything, pro- are associated with the QC for the duration of the parallel 

vides the ability for any resource (e.g., central processing execution of a query. Each of these threads is called a Query 

unit, memory, or disk storage) to be available to any other Server (QS). Each QS executes a parallel operator and 
resource. FIG. lA illustrates a shared everything hardware 65 processes a subset of intermediate or ou^ut data. The 

architecture. FIG. lA illustrates a shared everything hard- parallel operators that are executed by a QS are called data 

ware architecture. All of the resources are interconnected, flow operators (DFOs). 



5,857,180 

3 4 

A DFO is represented as an extended structured query input from two or more operations. If the result of any input 

language (SQL) statement. A DFO is a representation of one operation does not produce any rows for a given consumer 

row source or a tree of row sources suitable for parallel of that operation, then the subsequent input operation must 

execution. A DFO SQL statement can be executed concur- not produce any rows for that consumer. If a subsequent 
rently by multiple processes, or query slaves. DFOs intro- 5 input operation were to produce rows for a consumer that did 

duce parallelism into SQL operations such as table scan. not expect rows, the input would behave eironeously. as a 

order by, group by, joins, distinct, aggregate, unions, "sorcerer's apprentice." 

intersect, and minus. A DFO can be one or more of these present invention uses bit vector to monitor whether 

operations. ^^^^i consumer process received any rows from any pro- 

A central scheduling mechanism, a data flow scheduler, is lo ducer slaves. Each consumer is represented by a bit in the bit 
allocated at compile time. When the top (i.e., root) of a row vectors. When all of the end of fetch (i.e. eof) messages are 
source tree, or a portion of a serial row source tree is received from the producers of a consumer slave, the con- 
encountered that cannot be implemented in parallel, the sumer sends a done message to a central scheduling mecha- 
portion of the tree below this is allocated for parallelism, A nisra (i.e., a data flow scheduler). The data flow scheduler 
data flow scheduler row source is allocated at conq)ilation is determines whether the consumer slave received any rows, 
time and is executed by the QC process. It is placed between and sets the consumer's bit accordingly. The bit in the bit 
the serial row source and the paralleUzable row sources vector is used by subsequent producers to determine whether 
beiow the serial row source. Every data flow scheduler row any rows need to be produced for any of its consumers. The 
source and the parallelizable row sources below it comprise bit vector is reset at the beginning of each level of the tree, 
a DFO tree. A DFO tree is a proper subtree of the row source 20 dataflow scheduler uses states and a count of the 
tree. A row source tree can contain multiple DFO trees. ^i^^^^ that have reached these states to perform its sched- 

If, at execution, the row source tree is implemented using uiing tasks. As the slaves asynchronously perform the tasks, 

parallelism, the parallelizer row source can implement the transmitted to them by the dataflow scheduler, they transmit 

parallel processing of the DFOs in the row source tree for state messages to the dataflow scheduler indicating the 

which it is the root node. If the row source tree is imple- stages they reach in these tasks. The data flow scheduler 

mented serially, the parallelizer row source becomes in vis- keeps track of the states of two DFOs at a time (i.e., the 

ible. That is, the rows produced by the row sources in the current DFO and the parent of the current DFO). A "started" 

DFO tree merely pass through the parallelizer to the row state indicates that a slave is started and able to consume 

sources above them in the row source tree. rows. A **ready" state indicates that a slave is processing 

The present invention uses table queues to partition and rows and is about to produce rows. A *'partial" state indicates 

transportrows between sets of processes. A table queue (TQ) that a slave is finished scanning a range of rowid, or 

encapsulates the data flow and partitioning functions. A TQ equivalentiy, scanning a range of a file or files that contains 

partitions its input to its output according to the needs of the rows, and needs another range of rowids to scan additional 

parent DFO and/or the needs of the entire row source tree. rows. "Done" indicates that a slave is fiiushed processing. 

The table queue row source synchronously dequeues rows 

from a table queue. ATQ connects the set of producer slaves BRIEF DESCRIPTION OF THE DRAWINGS 

on its input to the set of consmner slaves on its output ^^^^j^ illustrates shared everything, shared 

Dunng the compilation and optiimzation process, each nothing, and shared disk environments. 

node in the row source tree is annotated with parallel data * „^^„.m^o o« ^^^rr.r.}^ ^at^K^c^ taw^c ^r.A o« 
..T-i J- 40 FIG. 2 provides an exanmle or oataoase tables and an 

flow information. Linkages between nodes in a row source cfrurt d O T 

treeprovidetheability to divide the nodes into multiple lists. V cry g g q ry- 

Each list can be executed by the same set of query slaves. SAiUustrates an example of a serial row source tree. 

In the present invention only those processes that are not I^G. 3B iUustrates a paraUelized row source tree, 
dependent on another's input (i.e., leaf nodes), and those 45 FIG. 3C illustrates a row source tree divided into levels 

slaves that must be executing to receive data from these each of which is implemented by a set of query slaves, 

processes execute concurrently. This technique of invoking FIG. 4 illustrates table queues, 

only those slaves that are producing or consuming rows pj^. 5 illustrates a right-deep row source tree, 

provides the ability to minimize the number of query slaves „^ ^ , . , , ^ „ 

needed to implement paralleHsm. . P^^^**^^ ^^^^^ parallehsm annotation 

The present invention includes additional row sources to 

facititate the implementation of the paraUelism, These FIG. 6B provides an example of information sent to query 

include table queue, table access by partition, and index slaves. 

creation row sources. An index creation row source FIGS. 7A-7F illustrates slave DFOs process flows. 

assembles sub-indices from underlying row sources. The 55 FIG. 8 illustrates a row source tree including parallelizer 

sub-indices are serially merged into a single index. Row row sources. 

sources for table and index scanning, table queues, and fig. 9 iUustrates a three way join. 

remote tables have no underlying row sources, since they ™^ j ah *n «,n-i*^»^ «««« « 

J J- ^ A . V. ^ ^^^y FIG. lOA provides an Allocate Parallehzer process flow. 

read rows du*ectly from the database, a table queue, era .r.i. 

remote data store. 60 l^^^-l^C provide an example of the process flow 

A table queue row source is a mechanism for partitioning TreeTraversal. 

and transporting rows between sets of processes. The par- IIA illustrates a process flow for Fetch, 

titioning function of a table queue row source is determined f^G. IIB provides an exan^)le of the process flow of 

by the partitioning type of the parent DFO. ProcessRowOutpuL 

The present invention provides the ability to eliminate 65 FIGS. UC-llD illustrate a process flow of ProcessMs- 

needless production of rows (i.e., the sorcerer's apprentice gOu^ut 

problem). In some cases, an operation is dependent on the FIG. 12 illustrates a Resume process flow. 
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FIG. 13 illustrates a process flow for ProcessReadyMsg. 

FIG. 14 provides a process flow for NextDFO. 

FIG. 15 illustrates a process flow for Start 

FIG. 16 illustrates a process flow for Qose. 

FIG. 17 illustrates a process flow for SendCloseMsg. 

FIG. 18 illustrates a StartParallelizer process flow. 

FIG. 19 illustrates a Stop process flow. 

DETAILED DESCRIPTION OF THE 
INVENTION 

A method and apparatus for parallel query processing is 
described. In the following description, numerous specific 
details are set forth in order to provide a more thorough 
description of the present invention. It will be apparent 
however, to one skilled in the art, that the present invention 
may be practiced without these specific details. In other 
instances, well-known features have not been described in 
detail so as not to obscure the invention. 

ROW SOURCES 

Prior to execution of a query, the query is compiled. The 
compilation step deconqx)ses a query into its constituent 
^ parts. In the present invention, the smallest constituent parts 
are row sources, A row source is an object-oriented mecha- 
nism for manipulating rows of data in a relational database 
system (RDBMS). A row source is implemented as an 
iterator. Every row source has class methods associated with 
it (e.g., open, fetch next and dose). Examples of row sources 
include: count filter, join, sort, union, and table scan. Other 
row sources can be used without exceeding the scope of the 
present invention. 

As a result of the compilation process, a plan for the 
execution of a query is generated. An execution plan is a 
plan for the execution of an SQL statement An execution 
plan is generated by a query optimizer. A query optimizer 
compiles an SQL statement, identifies possible execution 
plans, and selects an optimal execution plan. One method of 
representing an execution plan is a row source tree. At 
execution, traversal of a row source tree firom the bottom up 
yields a sequence of steps for performing the operation(s) 
specified by the SQL statement. 

A row source tree is composed of row sources. During the 
compilation process, row sources are allocated, and each 
row source is linked to zero, one, two, or more underlying 
row sources. The makeup of a row source tree depends on 
the query and the decisions made by the query optimizer 
during the compilation process. TVpically, a row source tree 
is comprised of multiple levels. The lowest level, the leaf 
nodes, access rows from a database or other data store. Hie 
top row source, the root of the tree, produces, by 
composition, the rows of the query that the tree implements. 
The intermediate levels perform various transformations on 
rows produced by underlying row sources. 

Referring to FIG. 2, SQL statement 216 illustrates a query 
that involves the selection of department name 214 from 
department table 210 and employee name 206 from 
enq)loyee table 202 where department's dq)artment number 
is equal to the employee's department number. The result is 
to be ordered by employee name 206. The result of this 
operation will yield the en^)loyee name and the name of the 
department in which the employee works in order of 
employee. 

An optimal plan for execution of SQL statement 216 is 
generated. A row source tree can be used to represent an 
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execution plan. FIG. 3A illustrates an example of a row 
source tree for this query 216. Row source tree 300 is 
comprised of row sources. Table scan row source 310 
performs a table scan on the employee table to generate rows 

5 from the en^)loyee table. The output of table scan 310 is the 
input of sort 306. Sort 306 sorts the input by department 
number. Table scan row source 312 performs a table scan on 
the department table to generate rows fi*om the department 
table. The output of table scan 312 is the input of sort 308. 

10 Sort 308 sorts the input by department number. 

The output from the two sort row sources (i.e., sort 306 
and sort 308) is the input to sort/merge join row source 304. 
Sort/Merge join 304 merges the input from the employee 
table (i.e., the input from sort 306) with the input from the 

15 department table (i.e.', the input from sort 308) by matching 
up the department number fields in the two inputs. The result 
will become the output of sort/merge join 304 and the input 
of orderBy 302. OrderBy 302 will order the merged rows by 
the employee name. 

20 

DATA FLOW OPERATORS 

A Query Coordinator (QC) assumes control of the pro- 
cessing of a query. The QC can also execute row sources that 
are to be executed serially. Additional threads of control are 
associated with the QC for the duration of the parallel 
execution of a query. Each of these threads is called a Query 
Server (QS). Each QS executes a parallel operator and 
processes a subset of the entire set of data, and produces a 
subset of the output data. The parallel operators that are 
executed by a QS are called data flow operators (DFOs). 

A DFO is a representation of row sources that are to be 
computed in parallel by query slaves. A DFO for a given 
query is equivalent to one or more adjacent row sources of 
35 that query's row source tree at the QC. Each DFO is a proper 
subtree of the query's row source tree. A DFO is represented 
as structured query language (SQL) statements. A DFO SQL 
statement can be executed concurrently by multiple 
processes, or query slaves. DFOs introduce parallelism into 
4Q SQL operations such as table scan, orderBy, group by, joins, 
distinct, aggregate, unions, intersect and minus. A DFO can 
be one or more of these operations. A DFO is converted back 
into row sources at the query slaves via the normal SQL 
parsing mechanism. No additional optimization is per- 
45 formed when DFO SQL is processed by the slaves. 

An SQL table scan scans a table to produce a set of rows 
from the relation, or table. A "group by" (groupBy) rear- 
ranges a relation into groups such that within any one group 
all rows have the same value for the grouping column(s). An 
50 "order by" (orderBy) orders a set of rows based on the 
values in the OTderBy colunin(s). A join joins two or more 
relations based on the values in the join column(s) in the 
relations. Distinct eliminates any duplicate rows from the 
rows selected as a result of an operation. 
55 Aggregates compute functions aggregated over one or 
more groups of rows. Count, sum and average aggregates, 
for example, compute the cardinality, sum, and average of 
the values in the specified column(s), respectively. Maxi- 
mum (i.e. Max) and minimum (i.e., Min) aggregates com- 
60 pute the largest and smallest value (respectively) of the 
specified column(s) among the group(s) of rows. A uiuon 
operation creates a relation consisting of all rows that appear 
in any of two specified relations. An intersect operation 
creates a relation that consists of all rows that ^pear in both 
65 of two specified relations. A minus operation creates a 
relation that consists of all rows that appear in the first but 
not the second of two specified relations. 
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PAKimONING IC. and ID), any data is accessible by any process (e.g.. 

Existing paraUel query systems are implemented in a shared disk and shared everything). Thus, multiple central 

shared nothing environment FIG. IB illustrates a shared Processing umts can access any data stored on any storage 

nothing hardware architecture. In a shared nothing device. The present mvenbon provides the ability to dynami- 
environment, each computer system is comprised of its own ^ cally partition an operation (e.g., table scan) based on the 

resources (e.g.. memory, central processing unit, and disk ^"""t °^ '^^ i"^^** *^ lo'^^o" *e 

storage). That is, a shared nothing environment is comprised For example, the present invention provides the ability to 

of one or more autonomous computer systems, and each spread a table scan aaoss "N" slaves to balance the load, and 

system processes its own data. For example, system one in to perform a table scan on a table such that each slave 
FIG. IB is comprised of a central processing unit (i.e.. CPU '° finishes at virtually the same time. The present invention 

1), memory (i.e.,, memory 1), and disk storage (i.e., disk determines an optimal number of slaves. "N". to perform an 

storage 1). Similarly, system n contains similar resources. operation. AH "N" slaves can access all of the data. For 

A DBMS implemented in a shared nothing environment example, a table can be divided into three groups of "N" 

has an automatic partitioning scheme based on the physical partitions (ie.. "3N") using three groups of "N" ranges(i.e., 

location of data. Therefore, partitioning, in a shared nothing "3N")- The ranges can be based on the values that identify 

environment, is determined at the time the physical layout of *e rows (i.e., entries) in a table. Further, the "3N" partitions 

data is determined (i.e.. at the creation of a database). Thus, arranged based on size. Thus, there are "N" large 

any partitioning in a shared nothing environment is static. partitions, "N" medium-sized partitions, and "N" smaU 

A scan of a table in a shared nothing enviromnent nec- Potions. Each partition represents are partial execution of 

essaiily includes a scanning process at each autonomous operation. 

system at which the table is located. Therefore, the parti- The larger groups of rowids are submitted to the "N" 
tioning of a table scan is determined at the point that the slaves first. Each slave begins to process its rowid range. It 
location of data is determined. Thus, a shared nothing is possible for some processes to complete their tasks before 
environment results in a static partitioning scheme that others (e.g., system resource fluctuations or variations in the 
cannot dynamically balance data access among multiple estimations of partition sizes). When a process completes a 
processes. Further, a shared nothing environment limits the partial execution, another set of rowid ranges can be sub- 
ability to use a variable number of scan slaves. Aprocess, or mitted to the process. Since all of the large partitions were 
slave, running in system one of FIG. IB can manipulate the submitted to the "K* slaves at the start of a scan, faster 
data that is resident on system one, and then transfer the slaves receive a medium or smaU rowid range partial execu- 
results to another systenL However, the same process cannot tion. Similarly, as each slave completes its current rowid 
operate on data resident on another system (i.e. , system two range, additional rowid ranges can be submitted to the slave, 
through system n). Because decreasing sizes of rowid ranges are submitted to 
Thus, processes on each system can only process the data t^e faster slaves, aU of the slaves tend to finish at virtually 
resident on its on systemi, and cannot be used to share the 35 

processing load at other systems. Therefore, some processes Partitioninc Tvoes 
can complete their portion of a scan and become idle while 

other processes are still processing table scan tasks. Because The present invention provides the ability to dynamically 

each system is autonomous, idle processes cannot be used to partition using any performance optimization techniques, 
assist the processes still executing a data access (e.g., table 4^ For example, prior to the execution of an operation to sort 

scan) on other systems. a table (i.e., order by), a sampling can be performed on the 

The present invention provides the ability to dynamically ^ table. From the results of the sampling, even 

partition data. The present invention can be implemented distributions of the rows can be identified. These distribu- 

using any of the hardware architectures (i.e,, shared nothing, tions can be used to load balance a sort between multiple 
shared disk, or shared everything). Further, the present 45 processes. 

invention can be used in a software-implemented shared Some examples of partitioning include range, hash, and 
disk system. A software-implemented shared disk system is round-robin. Range partitioning divides rows from an input 
a shared nothing hardware architecture con[ibined with a row source to an output row source based on a range of 
high bandwidth communications bus and software that values (e.g., logical row addresses or column value). Hash 
allows blocks of data to be efficiently transmitted between 50 partitioning divides rows based on hash field values. Round- 
systems. Software implementation of a shared resource robin partitioning can divide rows from an input row source 
hardware architecture reduces the hardware costs connected to an output row source when value based partitioning is not 
with a shared resource system, and provides the benefits of required. Some DFOs require ouQ)uts to be replicated, or 
a shared resource systena. broadcast, to consumers, instead of partitioned. 

FIG. ID illustrates a software-implemented shared disk 55 pap at t ft fzattok 
environment. System one through system n remain autono- ^^^^ fak aj j . r . i . i / . / u iuxn 
mous in the sense that each system contains its own A serial execution plan (e.g., FIG. 3A) provides a non- 
resources. However, a communications bus connects the paraUelizedrepresentationof a plan for execution of a query, 
systems such that data firom system one through system n In serial query processing, only one thread of control pro- 
can transfer blocks of data. Thus, process, or slave, running 60 cesses an entire query. For example, a table scan of the 
in system one can perform operations on data transferred employee table (Le., table scan 310 in FIG. 3A), for 
firom another system (e.g., system n). example, is scanned sequentially. One process scans the 

In addition to the software-implemented shared disk employee table, 

envirormient, the present invention can be implemented in a The parallelism of the present invention provides the 
shared everything hardware architecture (illustrated in FIG. 65 ability to divide an execution plan among one or more 

lA), and a shared disk hardware architecture (illustrated in processes, or query slaves. Parallel query execution provides 

FIG. IC). In the shared resource environments (FIGS. LA, the ability to execute a query in a series of parallel steps, and 
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to access data in parallel. For example, a table scan of the FIG. 3B represents a parallel DFO tree corresponding to 

employee table can be partitioned and processed by multiple the row source tree depicted in FIG. 3 A. Sort 306 and sort 

processes. Therefore, each process can scan a subset of the 308 of FIG. 3A are combined with sort/merge join 304 to 

employee table, distinguish the sort/merge join DFO of FIG. 3B. Referring 

At compilation time, a sequential query execution plan is 5 to FIG. 3B, slave DFOs 324A-324C perform the sort and 

generated. Then, the execution plan is examined, from the join operations. The output of slave DFOs 330A-330C is 

bottom up, to determine those portions of the plan that can transmitted to slave DFOs 324A-324C. 

be parallelized. Parallelism is based on the ability to paral- ™ . , „ „ |^ ^ tunt 

, T- • • ^ ^ Table scan 332 scans a table (i.e., department table) that 

lelize a row source. Further, the partitiomng requurements of . ^ *i. i- * 

consecutive low sources and the partitiolig requirements ,„ T^^/^^^il^""' application of par- 

of the entire row source tree is examined. Further, the '° department toble may not improve 

present invention provides the ability for the SQL statement performance. Therefore table scan 332 can be implemented 

to specify the use and degree of paraUeUsm. as a non-parallel scan of the department table. The output of 

The present invention provides the ability to combine ^^^^ °^ '^"''^ 

parallelism and serialism in the execution of a query. Par- 324A 324C. 

alleUsm may be limited by the inability to paraUelize a row ^ SQL statement can specify the degree of parallelism 
source. Some row sources cannot be parallelized. For to used for the execution of constituent parts of an SQL 
example, an operation that computes row numbers must statement. Hints incorporated in the syntax of the statement 
allocate row numbers sequentially. When a portion of a row t>e used to affect the degree of parallelism. For example, 
source tree is encountered that cannot be implemented in an SQL statement may indicate that no amount of parallel- 
parallel, any portion below the serial row source is allocated ^sm is to be used for a constituent table scan. Further, an SQL 
for parallelism. A parallelizer row source is allocated statement may specify the maximum amount of partitioning 
between the serial row source and the parallelizable row iii^)lemented on a table scan of a given table, 
sources below the serial row source. The parallelizer row TABLE OUEUES 
source and the parallelizable row sources below it comprise ^5 

a DFO tree. The ou^ut of this DFO tree is then supplied as Some DFOs function correctly with any arbitrary parti- 
input to the serial row source. A row source tree can contain tioning of input data (e.g., table scan). Other DFOs require 
multiple DFO trees. a particular partitioning scheme. For example, a group by 

If, at execution, a given row source tree is implemented DFO needs to be partitioned on the grouping column(s). A 

using parallelism, the parallelizer row source can implement 3Q sort/merge join DFO needs to be partitioned on the join 

the parallel processing of the parallelizable DFOs in the row column(s). Range partitioning is typically chosen when an 

source tree for which it is the root node. If the row source orderBy operation is present in a query. When a given child 

tree is in^)lemented serially, the parallelizer row source DFO produces rows in such a way as to be incompatible 

becomes invisible. That is, the rows produced by the row with the partitioning requirements of its parent DFO (i.e., the 

sources in the DFO tree merely pass through the parallelizer 35 DFO consuming die rows produced by a child DFO), a table 

row source to the row sources above it in the row source tree. queue is used to transmit rows from the child to the parent 

The row source tree is examined to determine the parti- DFO and to repartition those rows to be compatible with the 

tioning requirements between adjacent row sources, and the parent DFO. 

partitioning requirements of the entire row source tree. For The present invention uses a table queue to partition and 

example, tie presence of an orderBy row source in a row 40 transportrows between sets ofprocesses. A table queue (TQ) 

source tree requires that all value based partitioning in the encapsulates the data flow and partitioning functions. ATQ 

row source tree below the orderBy must use range parti- partitions its input to its output according to the needs of the 

tioning instead of hash partitioning. This allows the orderBy consumer DFO and/or the needs of the entire row source 

to be identified as a DFO, and its operations parallelized, tree. The table queue row source synchronously dequeues 

since ordered partitioning of the orderBy DFO*s output will 45 rows from a table queue. ATQ connects the set of producers 

then produce correct ordered results. on its input to the set of consumer slaves on its output 

An orderBy operation orders the resulting rows (i.e., the A TQ provides data flow directions. ATQ can connect a 

output from the executed plan) according to the orderBy QC to a QS. For example; a QC may perform a table scan 

criteria contained in the SQL statement represented by the on a small table and transmit the result to a table queue that 

execution plan. To parallelize an orderBy operation, the 50 distributes the resulting rows to one or more QS threads. The 

query slaves that implement the operation each receive rows table queue, in such a case, has one input thread and some 

with a range of key values. Each query can then order the number of output threads equaling the number of QS 

rows within its range. The ranges output by each query slave threads. A table queue may connect some number, N, of 

(i.e., the rows ordered within each range) can then be query slaves to another set of N query slaves. This table 

concatenatedbasedontheorderBy criteria- Each query slave 55 queue has N input threads and N output threads. A table 

implementing the orderBy operation expects row sources queue can connect a QS to a QC. For example, the root DFO 

that fall within the range specification for that query slave. in a DFO tree writes to a table queue that is consumed by the 

Thus, the operations performed prior to the orderBy opera- QC. This type of table queue has some number of input 

tion can be performed using range partitioning to facilitate threads and one output thread 

the direction of the rows according to the range specifica- 60 FIG. 4 illustrates table queues using the parallel execution 

tion. plan of SQL statement 216 in FIG. 2. Referring to FIG. 3B, 

FIG. 3B illustrates a row source tree in which parallelism the output of the table scans 330A-330C becomes the input 

has been introduced. Table scan 310 in FIG. 3A is processed of sort/merge join DFOs 324A-324C. A scan of a table can 

by a single process, or query slave. In FIG. 3B, table scan be parallelized by partitioning the table into subsets. One or 

310 is partitioned into multiple table scans 330A-330C. 65 more subsets can be assigned to processes until the maxi- 

That is, the table scan of the employee table is processed by mum number of processes are utilized, or there are no more 

multiple process, or query slaves. subsets. 
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While an ordering requirement in an SQL statement may 
suggest an optimal partitioning type, any partitioning type ' 
may be used to perform a table scan because of the shared 
resources (e.g., shared disk) architecture. A table queue can 
be used to direct the output of a child DFO to its parent DFO 5 
according to the partitioning needs of the parent DFO and/or 
the entire row source tree. For example, table queue 406 
receives the output of table scan DFOs 402A-402C. Table 
queue 406 directs the table scan output to one or more 
sort/merge join DFOs 410A-410C according to the parti- 
tioning needs of DFOs 410A-410C. 

In some instances, there is virtually no beneiit in using 
parallel processing (e.g.. table scan of a table with few 
rows). Referring to FIG. 4, table scan 412 of a small table 
(i.e., department table) is not executed in parallel. In the 
preferred embodiment, a table scan performed by a single 
process is performed by QC 432. Thus, the input to table 
queue 416 is output from QC 432. Table queue 416 directs 
this output to the input of DFOs 410A-410C. Table queue 
416 connects QC 432 to QS slave DFOs 410A-410C. The 
input from table queues 406 and 416 is used by DFOs 20 
410A-410C to perform a sort/merge join operation. 

The output of DFOs 410A-410C is transmitted to table 
queue 420. Table queue 420 directs the ouQ)ut to DFOs 
424A-424C. The existence of an orderBy requirement in an 
SQL statement requires the use of a type of range partition- 25 
ing for table queue 420, and is suggested for range parti- 
tioning of TQ 406 and 416. Range partitioning will result in 
row partitions divided based on sort key value ranges. In the 
present example, SQL statement 216 in FIG. 2 specified an 
order in which the selected rows should be provided (i.e., 30 
ordered by employee name). Therefore, range partitioning is 
the partitioning scheme to execute SQL statement 216 in 
parallel. Thus, table queue 420 can direct a set of rows to 
each of the query slaves executing DFOs 424A-424C based 
on a set of ranges. Range partitioning can be used to divide 3^ 
the rows, by value ranges, between the query slaves pro- 
cessing the rows, 

DFO SQL 

A DFO is represented as structured query language (SQL) 40 
statements. For example, block 216 in FIG. 2 illustrates a 
selection operation from employee and department tables. A 
selection operation includes a scan operation of these tables. 
The DFO SQL for the employee table scan is: 

45 

select /*+rowid(e)*/ dcptoo cl, cTnpnamfi c2 
from emptable 

where rowid between :1 and :2 

50 

The "rl" and ":2" are rowid variables that delimit a rowid 
range. Actual rowid values are substituted at the beginning 
of execution. As each slave completes the scanning of a 
rowid range (i.e., completion of a partial execution), addi- 
tional rowid values are substituted at each subsequent partial 
execution. The scan produces the department field and 
employee name values. 

The DFO SQL statement above illustrates extensions of 
SQL that provide the ability to represent DFOs in a precise 
and compact manner, and to facilitate the transmission of the 
parallel plan to multiple processes. One extension involves ^ 
the use of hints in the DFO SQL statement that provide the 
ability to represent a DFO in a precise and compact way. In 
additional to the hint previously discussed to specify the use 
and/or degree of parallelism, the present invention provides 
the ability to incorporate hints in a DFO SQL statement to 65 
specify various aspects of the execution plan for the DFO 
SQL statement. For example, in the previous DFO SQL 
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statement, the phrase "/*-H-owid(e) *P' provides a hint as to 
the operation of the table scan DFO (i.e., use rowid 
partitioning). Other examples are: *full" (i.e.. scan entire 
table), *^ise_merge" (i.e., use a sort/merge join), and "use_ 
nl" (i.e., use a nested loop join). 

Another extension provides the ability to use and refer- 
ence table queues. The output of the employee table scan is 
directed to a table queue (e.g., Ql) as illustrated in FIG. 4. 
The contents of table queue Ql become the input to the next 
operation (i.e., sort/merge). The DFO SQL statement assigns 
aliases for subsequent references to these fields. The DFO 
statement further creates a reference for the columns in the 
resulting table queue (i.e., "cl" and "c2"). These "aliases" 
can be used in subsequent SQL statements to reference the 
columns in any table queue. 

A second table scan is performed on the department table. 
As illustrated previously, because the department table is 
small (i.e., a lesser number of table entries), the department 
table scan can be performed serially. The output of the 
department table scan is directed to the QO table queue. The 
contents of QO table queue becomes the input to the sort/ 
merge operation. 

The DFO SQL for the sort/merge operation is: 



select /*+use_jaieigc(a2)*/ alx2,a2.c2 
&xm :Q1 al, :Q0 a2 
where al.cl - a2.cl 



The sort/merge DFO SQL operates on the results of the 
employee table scan (i.e., Ql table queue, or "al"), and the 
results of the department table scan (i.e., QO table queue, or 
"a2"). The output of the sort/merge join DFO is directed to 
table queue Q2 as illustrated in FIG. 4. The contents of table 
queue Q2 becomes the input to the next operation (i.e., 
orderBy). The DFO SQL for the orderBy operation is: 

select cl, c2 from ;Q2 order by cl 

The orderBy operation orders the results of the sort/merge 
join DFO. The ou^ut of the orderBy operation is directed to 
the requester of the data via table queue Q3. 

COMBINED DFOs 

If the partitioning requirements of adjacent parent-child 
DFOs are the same, the parent and child DFOs can be 
combined. Combining DFOs can be done using the SQL 
mechanism. For example, a reference to a table queue in a 
SQL statement (e.g., Qn) is replaced with the SQL text that 
defines the DFO. For example, if block 216 in FIG. 2 
specified "order by deptNo ," the sort/merge join and tiie 
orderBy operations can be combined into one DFO SQL. 
Thus, the first two statements can be combined to be 
statement three: 



1. select /*+oidered use__ineige(a2)*/ al.c2,a2.c2,a2.c2 
from ;Q1 al, :Q0 a2 

where al.cl = a2.cl 

2. select c2, c3 from :Q2 order by cl 

3. select c2, c3 

from (select /♦-HDidcrcd use_mci:ge(a2)*/ alxl cl,al.c2 c2, 
a2.c2 c3 

from :Q1 al, :Q0 a2 
where al.cl = a2.cl) 
order by cl 



PLAN ANNOTAnONS 

During the compilation and optimization process, each 
node in the row source tree is annotated with parallel data 
flow information. FIG. 6A provides an example of parallel- 
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ism annotation information. If the node is a DFO, the type 
of DFO is retained (e.g.. table scan, sort/merge join, distinct, 
and orderBy). If the node is a serial row source to be 
processed by the QC, the table queue to which the QC 
outputs the rows generated from the execution of the row 
source is stored with the other information associated with 
the row source, A node that represents a DFO also contains 
infonnation regarding the DFO. 

The number of query slaves available at the time of 
execution effects the degree of parallelism implemented. 
The number of available processes may be affected by, for 
example, quotas, user profiles, or the existing system activ- 
ity. The present invention provides the ability to implement 
any degree of parallelism based on the number of query 
slaves available at runtime. If enough query slaves are 
available, the degree of parallelism identified at compile 
time can be fully implemented. If some number less than the 
number needed to fully implement the degree of parallelism 
identified at compile time, the present invention provides the 
ability to use the available query slaves to implement some 
amount of parallelism If the number of available query 
slaves dictates that the query be implemented serially, the 
present invention retains the row source equivalent for each 
node. Thus, the present invention provides the ability to 
serially implement a query parallelized at compile time. 

If the node is implemented by the QC, the ou^ut table 
queue identifier is included in the node information. If the 
node is not implemented by the QC, the pointer to the first 
child of the parallelized node, the number of key columns in 
the input table queue, the paraUeUzed node's partitioning 
type, and the number of columns clumped with parent are 
included in the node information. 

If the node represents a table scan DFO, the information 
includes table scan information such as table name and 
degree of parallelism identified for the scan. If the DFO is 
an indexed, nested loop join, the information includes the 
right and left input table names. If the DFO is a sort/merge 
join, the information includes two flags indicating whether 
the operation is a merge join or an outer join. If the DFO 
represents an index creation, the information includes a list 
of columns included in the index, the index type, and storage 
parameters. 

At the time of implementation, information describing the 
DFOs is sent to the query slaves implementing the DFOs. 
All DFOs of an even depth are sent to one slave set. All 
DFOs of an odd depth are sent to the other slave set Depth 
is measured from die top (root) node of the tree. FIG. 6B 
provides an example of information sent to query slaves. 
This information includes a pointer to the next DFO for the 
slave set to execute. The next-to-execute pointer points to 
the next DFO at the same depth, or, if the current DFO is the 
last at its depth, the pointer points to the leftmost DFO in the 
tree at depth-2. The next-to-executje pointer links the DFOs 
not implemented by the QC into a set of subtrees, or lists. 

Using the next-to-execute pointer, a row source tree can 
be split into two DFO lists that can be executed by two sets 
of query slaves. The DFOs executed by a first set of query 
slaves is given by a list starting with the leftmost leaf of the 
DFO tree and linked by the next-to-execute pointers. The 
DFOs executed by a second set of query slaves is given by 
the list starting with the parent of the leftmost leaf and linked 
by another set of sibling pointers. 

The present invention can be implemented without a 
central scheduling mechanism In such a case, all of the 
slaves needed to implement the DFOs are implemented at 
the start of execution of the row source tree. However, many 
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of the slaves must wait to begin processing (i.e., remain idle) 
until other slaves supply data to them 

In the preferred embodiment of the present invention, a 
central scheduling mechanism is used to monitor the avail- 
5 ability of data, and to start slaves as the data becomes ready 
for processing by the slaves. Therefore, the only slaves that 
are started are those that can begin processing inmiediately 
(i.e., leaf nodes), and those slaves that must be executing to 
receive data firom the leaf nodes. This technique of invoking 
only those slaves that are producing or consuming rows 
provides the ability to minimize the number of query slaves 
needed to implement parallelism 

For example, a first set of query slaves can be used to 
produce rows for a second set of query slaves. Once the first 
set (i.e., the producing set of query slaves) coinpletes its task 
of producing rows, the set can be used to implement the 
DFOs that consume the output from the second set of query 
slaves. Once the second set of slaves completes its task of 
producing rows for the first set, the set can be used to 
2^ inyjlement the level of the tree that receives input from the 
first set This technique of folding the DFO tree around two 
sets of slave sets minimizes the number of slaves needed to 
implement a tree. As the depth of the tree increases, the 
savings in processing power increases. Further, this tech- 
nique provides the ability to implement an arbitrarily com- 
plex DFO tree. 

FIG. 3C illustrates a row source tree divided into thirds 
(i.e.. Sets A-C) by lines 340 and 342 representing the levels 
of the tree that can be implemented by one set of query 
slaves. For example. Set A includes DFOs 330 A-C and 
. DFOs 344A-344C. These DFOs can be processed by a first 
slave set (i.e., slave set A). 

The query slaves in slave set A perform table scans on an 
employee table and a department table. The rows generated 
35 by these tables scans are the output of slave set A. The output 
of slave set A becomes the input of the query slaves in set 
B. Thus, the query slaves in set B must be ready to receive 
the output from slave set A. However, the query slaves 
implementing the operations in set C do not have to be 
4Q invoked until slave set B begins to generate ou^ut Slave set 
B must sort and merge the rows received from slave set A. 
Therefore, output from slave set B cannot occur until after 
slave set A has processed aU of the rows in the en^loyee and 
department tables. Therefore, once slave set A finishes 
45 processing the DFOs in set A, slave set A is available to 
implement the DFOs in set C. Therefore, the implementation 
of tree 350 only requires two slave sets (slave set A and B). 

Referring to FIG. 6B, information sent to query slaves 
include the output TQ identifier, the number of rowid- 
50 partitioned tables, the size of the SQL statement representing 
the DFO, the SQL statement representing the DFO, and flags 
that define runtime operations (e.g., slave must send 
"Started" message, slave sends **Ready" message when 
input consumed, and close slave expects to be closed upon 
55 completion). 

Additional row sources facilitate the implementation of 
the parallelism of the present invention. These include 
parallelizer, table queue, table access by partition, and index 
creation row sources. An index creation row source 
60 assembles sub-indices from underlying row sources. The 
sub-indices are serially merged into a single index. Row 
sources for table and index scanning, table queues, and 
remote tables have no underlying row sources, since they 
read rows direcdy from the database, a table queue, or a 
65 remote data store. 

A table queue is a mechanism for partitioning and trans- 
porting rows Ixtween sets of processes. The input TQ 
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function of a table queue is determined by the partitioning data flow scheduler) are determined and retained at block 

type of the parent DFO. The following are examples of some 1020. At processing block 1022, tiie maximum depth of the 

considerations that can be used to determine the type of TQ tree is determined by examining the tree. At 1024, TreeTra- 

partitioning: versal is invoked to traverse the DFO tree for which the 

1. The inputs to a DFO must be hash partitioned, if the 5 current parallelizer row source is being allocated. Processing 
DFO requires value partitioning (e.g., a sort/merge join or processing block 1026. 

group by), there is no orderBy in the DFO tree, and the DFO TreeTraversal is invoked to further define the execution 

is not a nested loop join; environment for a DFO tree. FIGS. lOB and IOC provide an 

2. The inputs to a DFO must be range partitioned, if the example of the process flow for TreeTraversal. At processing 
DFO requires value partitioning (e.g., a sort/merge join or block 1032, the table queue identifier (TQ ID) is initialized 
group by), tiiere is an orderBy in the DFO tree, and the DFO to zero, and the starting TQ ID for parallel DFOs is deter- 
is not a nested loop join; mined. At decision block 1034 (i.e., "all nodes processed?"), 

3. If the DFO is a nested loop join., one input must be ^ j!^ been traversed proasssing returns to Allo- 
arbitrarily partitioned and the other input must access aU of „ "teParaUelizer at block 1036. the trav^sal .s not 
the input data either by using a broadcast TQ or a fuU table <=oniplete. processing continues at block 1038. The first, or 

next node in the execution order is identified at processing 

* block 1038 

4. When rows are returned to the QC partitions must be * . , . , 
returned sequentially and in order, if tiie statement contains processmg block 1040, tiie TQ connection code i.e., 

an orderBy. Otiierwise, the rows returned from tiie partitions ,0 ^"""^ ^^^^l '^^^i"" '^""^ ^"""^ ^Z"^ ^^7^ '^^""J 

can be interleaved. ^^"^ ^^^^^ ^^^^^ ^ 

or from QC to slave set 2, or from slave set 2 to QC) is 

DATA FLOW SCHEDULER determined, and the TQ* s partitioning type is determined. At 

processing block 1044, a TQ ID is assigned to the TQ. and 

The parallelizer row source (i.e., data flow scheduler) t^e TQ ID counter is incremented At decision block 1046 

implements the parallel data flow scheduler. A parallelizer 25 ^. .^^^^ scans?"), if there are no table scans in tiie DFO, 

row source links each DFO to its parent using a TQ. If processing continues at decision block 1046. If tiiere are 

parallelism cannot be implemented because of tiie unavail- table scans, tiie number of distinct tables scanned is 

ability of additional query slaves, tiie paraUelizer row source determined, and tiie index of distinct tables for tiiis DFO is 

becomes invisible, and tfie serial row source tree is imple- aUocated and initialized at processing block 1046. Process- 

mented. In tiiis instance, tiie parallelizer is merely a conduit 30 • continues at decision block 1050. 

between the underlying row source and the row source to a* j * • ui 1 ia^a /• ^ »««^^^ * 4. a u 

^. . ^ ,11. .^ J 1 * , At decision block 1050 (i.e., node to be executed by 

which the parallelizer is the underlying row source. In , *i 1 %*u^^^a^'. « * i 

, ^ , . J ? .1, £ J slave set 1 or slave set 2? ), if the node is executed by slave 

general, row sources are encapsulated and, therefore, do not ^ - . ^« * a^^-.-^ ui« 1 mc-* a.. 

f , . t_ 1. 1- 1 set 1. processing contmues at decision block 1052. At 

know anything about the row sources above or below them . . . , i/^.,/* u • io«\ 

^ ^ deasion block 1052 (i.e., node first m execution chain 17 ), 

PARALLELIZER ALLOCATION ^ node is the first to be executed in the first chain, this 

node is set as the current node at processing block 1054, and 

At compilation, when you reach a row source that is the processing continues at block 1058. If the node is not the 

top of a DFO tree, or is directly below a portion of the row first to be executed, the next node pointer of the previous 

source tree that cannot be parallelized, a parallelizer row ^ node in this chain is set to point to the current node at 

source is allocated between the top of the DFO tree and processing block 1056, and processing continues at block 

below the serial portion of the row source tree. FIG. 8 1058. 

illustrates a row source tree including paraUelizer row jf ^t decision block 1050, tiie node is to be executed by 

sources. Parallelizer 808 is allocated between DFO subtree ^lavc set 2, processing continues at decision block 1072. At 

810 and serial row source 806. ParaUelizer 812 is allocated decision block 1072 (i.e., "node first in execution chain 2?'), 

between DFO subtree 812 and serial row source tree 804. jjode is tiie first to be executed in the second chain, tiiis 

FIG. lOA provides an Allocate Parallelizer process flow. node is set as the current node at processing block 1074, and 

Processing block 1002 gets the rood DFO in the DFO tree processing continues at block 1058. If the node is not the 

and initializes flags. At processing block 1004, the number first to be executed, the next node pointer of the previous 

of table instances scanned is determined. At processing node in this chain is set to point to the current node at 

block 1006, the number of table queues is determined. The processing block 1076, and processing continues at block 

number of table queues receiving rows from serially pro- 1058. 

cessed nodes is determined at processing block 1008. yvt processing block 1058, the partitioning type for the TQ 

At decision block 1010 (i.e., "orderBy in queryT'), if an is determined. At processing block 1060, tiie table queue 

orderBy is present in the SQL statement being processed, an 55 format is initialized. At processing block 1062 the table 

orderBy flag is set, and processing continues at decision queue descriptor is allocated . At processing block 1062, the 

block 1014. If an orderBy is not present in the SQL table queue descriptor contains information regarding the 

statement, processing continues at decision block 1014. At TQ including the TQ ID, partitioning type, and connection 

decision block 1014 (i.e., "close message needed?"), if a code. The SQL for the DFO is generated at processing block 

close message must be sent to the slaves, a close flag is set, 60 1064. Processing continues at decision block 1034 to pro- 

and processing continues at processing block 1018. If no cess any remaining nodes of the tree, 
dose message is needed, processing continues at processing 

block 1018 PARALLELIZER INTTIAnON 

At processing block 1018, redundant columns that are not After an SQL statement is compiled and an execution plan 

key columns are eliminated from the SQL statement(s). The 65 is identified, the SQL statement can be executed. To execute 

start and ready synchronization requirements (i.e., whetiier an SQL statement, execution begins from the top of the row 

slaves need to communicate started and ready states to the source tree. From the root down, each node is told to 
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perform one of its operations (e.g.. open, fetch, or close). As tree, using the DFO tree pointers, to find the next node to 

each node begins its operations, it must call upon its implement. When a node is identified that is not already 

underlying nodes to perform some prerequisite operations. started, the parallelizer starts the node. FIG. 15 illustrates a 

As the tree is traversed in this manner, any parallelizer row process flow for Start. 

sources that are encount^ed are called upon to implement 5 At decision block 1502 (i.e., "Nodes serially 

its functionality (i.e., start). processed?"), processing continues at block 1504. At block 

Operations (e.g., fetching rows from DBMS) can be 1504, the node is started. At block 1506, the fact that no 

performed more than once. This results in multiple calls to ready message is needed is indicated (i.e., slaves will 

a parallelizer. When a parallelizer is called after a first call continue to process without ready synchronizations from the 
to the parallelizer, the parallelizer must be able to determine lO parallelizer). The counter is set to the number of slaves 

the state of the slaves implementing the underlying DFO tree implementing the node at block 1508. Processing continues 

(e.g., the state of the slaves, what DFOs are running). at block 1510. 

StartParallelizer, illustrated in FtGS. ISA and 18B, provides if, at decision block 1502. parallelism can be used to 
an example of the steps executed when a parallelizer row implement the node, processing continues at block 1520. At 
source is called. 15 block 1520, the slave counter is set to zero. At decision block 
At block 1802, flags are initialized (e.g., opened, started, 1522 (i.e., "start confirmation needed?"), if it is determined 
no row current, and not end of fetch). At decision block 1804 that a start confirmation is necessary, a flag is set to mark the 
(i.e./*restart with work in progress?"), if the parallelizer was state as 'TSTot Started" at block 1524, and processing con- 
not restarted with work in progress, jjrocessing continues at tinues at block 1510. 

block 1808. Processing continues at block 1808 to set the jf no start confimiation is needed, processing continues at 

maximum number of slaves to the maximum number of block 1526 to mark state as already started. At decision 

slaves allowed (i.e., based on a system* s limitations) per block 1528 (i.e., **ready confirmation needed?"), if ready 

query. confirmation is needed, processing continues at block 1510. 

At decision block 1810 (i.e., **rowid ranges set?"), if If it is not needed, the state is marked as already ready, and 

rowid ranges are set, processing continues at block 1814. If processing continues at block 1510. 

the rowid ranges have not been set, processing continues at At block 1510. an initial rowid range of each parallel table 

block 1812 to allocate rowid ranges per slave, and process- scan is obtained for each slave implementing the current 

ing continues at block 1814. At processing block 1814, the DFO. At block 1512, an execution message is sent to all of 

rowid ranges and the slave processes to implement the the slaves that are implementing the current node. At block 

underlying DFO tree are allocated. At decision block 1816 1514, the current node is marked as started. Processing 

(i.e., "any slaves available?"), if no slaves are available for returns at block 1516. 
allocation to perform the parallelism of the underlying DFO 

tree, processing continues at block 1834 to clear flags in SORCERER'S APPRENTICE 

output TQ. and at 1836 to start the underlying serial row The present invention provides the ability to eliminate 

source. Thus, where system limitations do not permit any needless production of rows (i.e., the sorcerer* s apprentice 

paralleHsm, the parallelizer initiates the serial row source problem). In some cases, an operation is dependent on the 

tree to implement the functionality of the parallel DFO tree. input from two other operations. If the result of the first input 

Processing returns at block 1834. operation does not produce any rows, there is no need for the 
if some amount of parallelism is available, processing ^ second input generator to produce any rows. However, 

continues at decision block 1818, At decision block 1818 unless these input generators are aware of the fact that there 

(i.e., "first execute?"), if this is the first execution of the is no need to continue processing, they will execute their 

parallelizer, processing continues at block 1820 to initialize operations. 

working storage (e.g., allocate variable length items from For example, a sort/merge join operation is dependent on 
the cursor work heap, allocate and initialize bind value the output of two separate underlying operations. If the 
pointers, allocate and initialize TQ data structures, allocate execution of the first underlying operation does not produce 
SMJ TQ consumer bit vector, and allocate partial execution any rows, there is no need to execute any remaining opera- 
bit vector). Processing continues at decision block 1822. tions in the sort/merge join task. However, unless the 
If this is not the first execution of the parallelizer, pro- processes executing the remaining underlying input are 
cessing continues at decision block 1822. At decision block 50 aware of the fact that there is no need to continue processing, 
1822 (ie., "SQL statement parsing necessary?"), if the they will continue to process despite the fact that there is no 
parsing is required, processing continues at block 1824 to need to continue. 

compile and bind DFO SQL statement at all of tiie slaves. This problem is further complicated when multiple pro- 
Processing continues at block 1826. If parsing is not cesses are involved (e.g., multiple slaves performing the first 
necessary, processing continues at block 1826. 55 table scan) because some of the processes may produce rows 

At block 1826, the current node is set to the first node to while others do not produce rows. Therefore, it is important 

be executed (i.e., the bottom-most left-most node of the to be able to monitor whether any rows are produced for a 

DFO tree). At block 1828, the current node's and its* given consumer. The producers of the rows can *t be used to 

parent's slave count is set to zero, the current node's and its' perform the monitoring function because the producers are 
parent's state is set to NULL. At block 1830, the TQ bit 60 not aware of the other producers or where the rows are 

vector is set, the partial execution bit vector is cleared, and going. Therefore, the consumer of the rows (i.e., the scat/ 

the row counter is set to zero. At 1832, Start is invoked to merge join processes) must monitor whether any rows are 

start the onrent DFO. Processing ends at block 1834. received from the producers. 

A bit vector is used to indicate whether each consumer 

Start Node process received any rows from any producer slaves. Each 

At various stages of implementation of a DFO tree, the consumer is represented by a bit in the bit vector. When all 

parallelizer (i.e., data flow scheduler) traverses the DFO of the end of fetch ("eof ') messages are received from the 
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producers of a consumer slave, the consumer sends a done EXAMPLE 

message to the data flow scheduler. The data flow scheduler Referring to FIG. 3C, each dataflow scheduler starts 

determines whether the consumer slave received any rows, executing the deepest, leftmost leaf in the DFO tree. Thus, 

and sets the consumer*s bit accordingly. The bit in the bit ^^e employee scan DFO directs its underlying nodes to 

vector is used by subsequent producers to determine whether 5 produce rows. Eventually, the employee table scan DFO is 

any rows need to be produced for any of its consumers. The ^^gin execution. The employee table scan begins in 

bit vector is reset at the beginning of each level of the tree. ^^^^y ^^^^ because it is not consuming any rows. Each 

FIG. 9 illustrates a three way join. Employee table scan is table scan slave DFO SQL statement, when parsed, gener- 

implemented by slave DFOs 902A-902C in the first slave ates a table scan row source in each slave, 

set. Rows produced by slave DFOs 902A-902C in the first lO executed, the table scan row source proceeds to 

set are used by the second slave set implementing the first access the employee table scan in the DBMS (e.g., performs 

sort/merge join (i.e., slave DFOs 906A-906C. respectively). underlying operations required by the DBMS to read 

The second set of input to sort/merge join slave DFOs j-^^g from a table), gets a first row, and is ready to transmit 

906A-906C is generated by department table scan slave ^ow to its output table queue. The slaves implementing 

DFOs 904A-906C in the first set, respectively. As slave 15 ^jig t^^je scan replies to the data flow scheduler that they are 

DFOs 902A-902C complete, the sorcerer's apprentice bit j-^ady. The data flow scheduler monitors the count to deter- 

vector is set to indicate whether any or none of slave DFOs ^^en all of the slaves implementing the table scan 

902A-902C produced any rows. If none of these slave DFOs jj^y^ reached the ready state. 

produced any rows, there is no need to continue processing. g^,^ scheduler determines whether 

Further, if slave DFOs 902A-902C did not produce any 20 DFO that is currentiy being implemented is the first child 

rows for consumer slave DFO 906C, there is no need for ^j^^ ^pQ if the data flow scheduler 

slave DFOs 9a4A-9a4C to send any output to consumer ^^^^^ ^ ^^^^^^^ ^ g^^^nd slave set to start the sort/merge 

slave DFO 906C. Therefore, subsequent slave processes • • j^^q 324A-324C). The slaves executing 

(e.g., 904C, 9a6C, 908C, or 910C) can examine the bit j^p^ 324A-324C) will transmit a "started" 

vector to determine what consumer slave DFOs should be 25 scheduler has received a 

serviced with input The bit vector is updated to reflect a »started"messagefromaUof the SM J slaves (i.e., "n" slaves 

subsequent consumer slaveys receipt (or lack thereof) of ^^^^^ ^^^^^ ^^^^^ ^^an and SMJ slaves), the 

rows from their producer slaves, and exammed by subse- scheduler sends a resume to the table scan slaves, 

quent producer slave processes to determine whether to when the table scan slaves receive the resume, they begin to 

process rows for their consumer slaves. ^^^^^^ ^^^^ 

PARALLELIZER EXECUTION ^"^^ execution, the table scan slaves may send a partial 

message. A partial message means that a slave has reached 

After a parallelizer has been initiated, its operations the end of a rowid range, and needs another rowid range to 

include synchronizing the parallel execution of the DFO 35 scan another portion of the table. The data flow scheduler 

tree. It allocates the DFOs in the DFO tree to the available does not have to wait for the other table scan slaves to reach 

slaves and specifies table queue information where appro- this state. The data flow scheduler determines whether any 

priate. Like other row sources, the parallelizer row source rowid ranges remain. If there are no remaining rowid ranges, 

can perform open, fetch, and close operations. the data flow scheduler sends a message to the table scan 

The data flow scheduler keeps track of the states of two 40 slave tiiat sent the "partial" message tiiat it is finished. If 

DFOs at a time (i.e., the current DFO and the parent of the there are more rowid ranges, the data flow scheduler sends 

current DFO). As tiie slaves asynchronously perform the the largest remaining rowid range to the table scan slave, 

tasks, transmitted to them by the dataflow scheduler, they When each of the table scan slaves finish their portions of 

transmit state messages to the dataflow scheduler indicating the scan, they send an "end of fetch*' ("eof) message to the 

the stages they reach in these tasks. The data flow scheduler 45 slaves that are executing the SMJ DFO via the table queue, 

tracks tiie number of slaves that have reached a given state, When the SMJ DFO receives the "eof ' messages from all of 

and the state itself. The counter is used to synchronize the the table scan slaves » the SMJ DFO will report to the data 

slaves in a slave set that are performing a DFO. The state flow scheduler that all of the table scan slaves are done, 

indicates the states of slaves implementing a DFO. For Once it is determined that all of the employee table scan has 

example, a started state indicates that a slave is started and 50 been coir^leted, the data flow scheduler determines the next 

able to consume rows. A ready state indicates that a slave is DFO to be executed. 

processing rows and is about to produce rows. A partial state The next DFO, the department table scan, is started. The 

indicates that a slave is finished with the range of rowids, same slave set is used to scan both the employee table and 

and needs another range of rowids to process additional the department table. The department table scan slave DFOs 

rows. Partial state is the mechanism by which slave pro- 55 (i.e.,344A-344C) will reach the ready state in the same way 

cesses indicate to the QC that they need another rowid range that the employee table scan reached ready. At that point, the 

to scan. Done indicates that a slave is finished processing. data flow scheduler must determine whether the department 

Some states are optional. The need for a given state is table scan is the first child of its parent, 

dependent on where the DFO is positioned in the DFO tree, In this case, the department table scan DFO is not (i.e., the 

and the structure of the DFO. All DFOs except the DFO at 60 employee table scan DFO was the first child of the parent of 

the top of the DFO tree must indicate when they are ready. the department table scan). Therefore, the parent DFO has 

Every DFO except the leaves of the DFO tree must indicate already been started, and is ready to consume the rows 

when they have started. A DFO that is a producer of rows produced by the department table scan slaves. Therefore, the 

reaches the ready state. Only table scan DFOs reach the data flow scheduler sends a 'Yesume" to the department table 

partial state. A DFO that consumes the output of another 65 scan slaves. The department table scan slaves wiU execute 

DFO reaches the started state. CMd DFOs that have a parent tiie department table scan sending "partial" messages, if 

reach the done state. applicable. 
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Once an "eof * message is received from all of the slaves slaves at processing block 1136, and processing returns to 
implementing the department table scan, the SMJ DFO* Fetch at block 1144. If the output is not an "eof," processing 

slaves can consume all of its inputs from the employee and continues at decision block 1138. 

department table scans, and will become ready to produce a At decision block 1138 (i.e., "callback procedure 

row. At this point, the SMJ DFO slaves can transmit a 5 5^pplied?»)^if the requester supplied a callback routine to be 
^Yeady" message to the data flow scheduler. ' ' ' ' used when rows have been produced, the data flow scheduler 

Once the data flow scheduler receives a **ready" message executes the callback routine, and processing returns to 

from the all of the slaves (i.e., count is equal to the number Fetch at block 1144. If there is no callback routine, process- 

of slaves implementing the SMJ DFO). the data flow sched- ing continues at processing block 1142 to decrement the 

uler must determine whether the SMJ DFO has parent. If so, number of rows to be supplied, and the number of rows 

the data flow scheduler must determine whether the SMJ supplied. Processing returns to Fetch at block 1144. 
DFO is the first child of its parent. If it is, the data flow 

scheduler must send a "execute" message to the slaves ProcessMsgOutput 

implementing the OrderBy DFO. In this case, the SMJ DFO The slaves executing the operations synchronized by the 

is the first child of the OrderBy DFO (i.e., 322A-322C). 15 dataflow scheduler send messages to die data flow scheduler 

Therefore, the data flow scheduler starts the OrderBy DFO. to request additional direction, or to communicate their 

Because the set of slave that implemented the table scans are states. When the data flow scheduler receives these 

done, the OrderBy DFO can be implemented by the same set messages, it processes them using ProcessMsgOutput 

of slaves that implemented the table scan DFOs. FIGS. IIC and IID illustrate a process flow of ProcessMs- 

Once the OrderBy DFO has started, it sends a "started" gOutput. At decision block 1162 (i.e., "Message= 

message to the data flow scheduler. When the data flow ^Started'?"), if the message received from a slave is 

scheduler has received "started" messages from aU of the "Started," processing continues at decision block 1164. If, at 

OrderBy DFO slaves, it can send a "resume" message to the decision block 1164 (i.e., "all slaves started?"), the data flow 

SMJ DFO slaves. The SMJ DFO begins to produce rows for scheduler has not received the "Started" message from all of 

consumption by the OrderBy slaves. As each SMJ DFO the slaves processing returns to Fetch at 1188. 

finishes, they send "eof * messages to the OrderBy DFO. If the data flow scheduler has received the "Started" 

Once the OrderBy DFO receives an "eof ' from all of the message from aU of the slaves, processing continues at block 

SMJ DFO slaves, the OrderBy DFO sends a message to the 1166. At processing block 1166, the slaves* next state 

data flow scheduler. Because the OrderBy DFO is at the top becomes "Ready," and the data flow scheduler specifies that 

of the tree, it does not have to go through any other states. none of the slaves have reached that state. After each slave 

Therefore, it can continue to output rows. has sent "Started" message to the data flow scheduler, they 

. wait for a "Resume" message in return. At processing block 

Fetch Operation jj^g^ scheduler sends a resume to the slaves. 

When a data flow scheduler receives a request for one or and processing returns to Fetch at block 1188. 
more rows, it executes its fetch operation. FIG. IIA illus- If, at decision block 1162, the output was not a start 
trates a process flow for Fetch. At decision block 1102 (i.e., message, processing continues at decision block 1170. At 
"current node not parallelized?"), if the current node is not decision block 1170 (i.e., "Message=*Ready'?"), if the out- 
parallelized, the row source operation is executed serially to put is a ready message, processing continues at block 1172 
satisfy the fetch request at block 1104. The data flow ^ to invoke ProcessReadyMsg. After the ready message is 
scheduler's fetch operation ends at block 1118. processed by ProcessReadyMsg, processing returns to Fetch 

If, at decision block 1102, it is determined that the current at block 1188. 

node is parafleUzed, processing continues at decision block If, at decision block 1170, the ou^ut was not a ready 

1106. At decision block 1106 (i.e., "does requester still want message, processing continues at decision block 1174. At 

rows?"), if the requester no longer wants rows, processing 45 decision block 1174 (i.e., **Message=*Partial*?"), if the mes- 

ends at block 1118. If the requester still wants rows, pro- sage was a "Partial," the slave has completed processing a 

cessing continues at block 1110. At block 1110, the data flow table scan using a range, and is requesting a second range 

scheduler waits for some ou^ut from the slaves processing designation to continue scanning the table. At processing 

the current node. block 1176, the data flow scheduler sends a remaining range 

At decision block 1112 (i.e., **received some output from 50 specification (if any) to the slave, and processing returns to 

a slave?"), if one or more rows are output from the slaves Fetch at block 1188. 

processing continues at processing block 1116 to invoke If, at decision block 1174, the message was not a partial 

ProcessRowOutput. If, at decision block 1112, the output is message, processing continues at decision block 1178. At 

message output, processing continues at block 1114 to decision block 1178 (i.e., **Message=*Done'?), if the mes- 

invoke ProcessMsgOuQ)ut. In either case, after the ou^ut is 55 sage is not a done message, processing returns to Fetch at 

addressed, processing continues at decision block 1106 to 1188. If the message was a done message, processing 

determine if more rows are requested by the requester, continues at block 1180 to get the next DFO to be executed. 

At processing block 1182, the bit vector is modified to 

ProcessRowOutput record which consumers of the rows received rows from the 

When the data flow scheduler determines that slaves have 60 finished slaves, 

generated rows (e.g., output rows to a TQ), the data flow At decision block 1184 (i.e., "all slaves done and some 

scheduler monitors the ou^ut using ProcessRowOutput . DFO is started or started DFO is next of next's parent?"), 

FIG. IIB provides an example of the process flow of processing continues at block 1186 to invoke NextDFO to 

ProcessRowOutput. At block 1132, the output is accessed in begin the next DFO, and processing returns to Fetch at block 

the output TQ. At decision block 1134 (Le., "*eof ' pulled 65 1188. If all of the slaves are not done or the started DFO is 

from TQ?"), if the TQ ou^ut is an end of fetch, data flow not ready, processing waits until the started DFO becomes 

scheduler marks aU slaves as being finished, and stops the ready, and returns to Fetch at block 1188. 
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Resume only the left input to its enclosing row source subtree. 

When a slave reports a ready for the current DFO. or a However it is possibk for a row source free to be ri^^ 

slave reports a starJed for the parent of the current DFO to ^hen a done reply is received from aU of die slaves, it is 

the data flow scheduler, the data flow scheduler responds to "e-jessary to determine when o execute the next DFO. In a 

the slave with a resume message to allow the slave to ' ^tX'^^^ ^ .'""^f ^d m FIG. 5. the n«t 

continue processing. FIG. 12 illustrates a Resume process DK) to execute ^er execution of current I^FO 502 is DFO 

flow At block 1202. the TQ ID for output, the TQ parti- ^."1 "°*P''?°L^^,°^*^^^'°*°:;f ^^'P'^^?^^^^^ 

tioning type, a node identifier, and the range partitioning nghtmost child of parent DFO SO^T^t is^next DFO 504 

keys are obtained. At decision block 1204 (i.e.. "node »^ "^^^ P^^"' °^ P"'" """^u 

executed by QC?"). if the node is being serially executed, '° f l^^e (i.e., resume p^ent DFO 506) after receivmg the 

processing continues at block 1206. At block 1206. the 'l^'?^ ''^'l^™"^-.^^^^ ' " ^^'c^J" 

process Lplementing the node (e.g.. QC. data flow wait untdcmrent DFO 502 is done, and parent DFO 506 h^^ 

scheduler) empties the entire row source into the appropriate reached a stable, ready state. Once parent DFO 506 has 

TQ. and Resume ends at block 1212. '^f'^l \ '^'^y data flow 

, .. . , . ^^^^ t J • « f J 15 scheduler is not a resume for parent DFO 506. Instead, the 

If, at deasion block 1204 the node is parallehzed, prc^ ^^^^^^^^ ^^^^^^ ^ ^ ^^^^^^^ ^^^^ 

cessing continues at block 1208 to send a resume message j^^^ ^^^^^ 

to aU of die slaves executmg ci^^^^^ node^ The next state ^ ^a^^ ^o remember that parent 

for the slaves is marked as DONE and the count of the ^ ^ waiting for a 

number of slaves diat have reached that sta^e is set o zero ^^^^^ ^ ^ ^^^^^^ ^ NextDFO. 

at processing block 1210. Resume ends at block 1212. ^ ...... . , ^ r 

^ FIGS. 14A and 14B provide a process flow for NextDFO. 

ProcessReadyMsg At processing block 1402, the current node, the next node in 

When a producer slave is about to produce rows, the ^he execution chain, the state of the parent, and the number 

producer slave sends a ^^Ready" message to the data flow of slaves executing the parent that have reached 

scheduler. When a ready message is received by the data identified. At processing block 1406, the sorcerer s appren- 

flow scheduler, the data flow scheduler processes the ready ^ce bit vector is used to execute or resume, if the next DFO 

message using ProcessReadyMsg. FIG. 13 illustrates a prcv apprentice (i.e., a DFO that needs to examine die 

cess flow for ProcessReadyMsg. At decision block 1302 J<^^ apprentice bit vector) to tiie cuirent DFO. 

(i.e.» "all slaves ready r*) if all of the slaves are not ready. At decision block 1408 (i.e., "is next a sibling of 

processing returns to Fetch at 1318 to wait until all of the current?"), if the next DFO to be implemented is a sibling of 

slaves reach the ready state. the current DFO, processing continues at decision block 

If, at decision block 1302, it is determined that all of tiie 1^12. If, at decision block 1408, the next DFO is not a 

states have reached ready (i.e., count is equal to the number sibling of the current DFO, the slave count for the parent is 

of slaves), processing continues at processing block 1304. At 3, set to zero, and the parent's state is set to NULL at block 

block 1304, no DFO started is indicated. At decision block 1410. Processing continues at decision block 1412. 

1306 (i.e,, "parent of current ready?"), if the parent of tiie At decision block 1412 (i.e., "does die next node have a 

current node is ready to receive the rows produced by the child?"), if the next node does not have a child, the current 

slaves implementing the current node, processing continues DFO*s state is set to NULL, and the number of slaves that 

at decision block 1308. 4q have reached that state is set to zero at processing block 

At decision block 1308 (i.e., "is die current done?") if die l^^l^. At processing block 1416, Start is invoked to start next 

slaves executing die current DFO have not reached die done Th^ next DFO is set to die current DFO at processing 

state, processing returns to Fetch to wait for diem to com- ^^^^ 1^33, processing returns at 1434. 

plete. If the slaves have reached die done state, NextDFO is If, at decision block 1412, die next node does have a child, 

invoked to implement die next node after the current DFO, 45 processing continues at block 1418. At block 1418, parent is 

and processing returns to Fetch at block 1318. set to the parent of die next node. At decision block 1420 

If, at decision block 1306 (i.e., "parent of current (i.e., 'Is next cuirent's parent?"), if tiie next node is not die 

readyr), die parent of the current is not ready, processing current's parent, die count is set to die number of slaves 

continues at 1310 to identify die parent of die cuirent DFO. executing die current node, and die state is set to die ready 

At decision block 1312 (i.e., "child first child of parent), if 50 Processing continues at decision block 1426. 

die cuirent node has a parent and the current node is die first If, at decision block 1420, it is determined tiiat next is 

child of the parent to be executed. Start is invoked at block current's parent, processing continues at block 1424 to set 

1316 to start die parent If die child is not die first child of the state of die cuirent node to die state of its parent, and to 

die parent, die parent has akeady been started. Therefore, at set the count for the number of slaves that have reached that 

block 1314, Resume is invoked to allow the parent to 55 state to the number of slaves implementing the parent that 

continue processing (e.g., consume the rows produced by have reached diat state. Processing continues at decision 

the child). Li either case, processing returns to Fetch at block block 1426. 

1318. ' At decision block 1426 (i.e., "have all current's slaves 

reached the ready state?"), if aU of the slaves implementing 

NextDFO ^ current node have not reached ready, die next DFO is set 

After die most recentiy process DFO reaches die done to die current DFO at processing block 1433, and processing 
state, it is necessary to determine die next DFO to be returns at block 1434. If all of the slaves are ready, process- 
executed. The pointers diat implement the structure of the ing continues at decision block 1428. At decision block 1428 
row source and DFO trees are used to identify the next DFO (Le., "does next have a parent and is next die first child of 
to be executed. 65 die parent?"), if next is the first child of its parent, Start is 

Generally, the row source tree is left deep. A row source invoked at block 1432 to start parent If next is not the first 

tree is left deep, if any row source subtree is the subtree of child of its parent Resume is invoked at block 1430 to 
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resume the parent. In either case, the next DFO is set to the a message to parse DFO SQL statements, resume operation, 

current DFO at block 1433, and processing returns at block execute a DFO, or close. When a message is received by a 

1434. slave DFO, the slave DFO must determine the meaning of 

the message and process the message. FIG. 7A illustrates a 

Close Operation ^ process flow for receipt of execution messages. 

The close operation terminates the query slaves. Qose At block 702, an execution message from the QC is read, 

can occur when the entire row source tree has been At decision block 704 (Le., "message is Sparse*?"), if the 

implemented, or at the end of a DFO tree. Initially, the execution message is a parse message, processing continues 

parallelizer sends a stop message to each of the slaves at block 706 to invoke SlaveParse, and processing continues 

running DFOs in the parallelizer' s DFO tree to tell each of lO at block 702 to process execution messages sent by the QC, 

the slaves to stop processing. This triggers the slaves to If the execution message is not a parse message, processing 

perform any dean up operations (e.g., release any locks on continues at decision block 708. At decision block 708 (i.e., 

data or resources) and to reach a state for termination. In "message is 'execute*?")? if the execution message is an 

addition, the close operation remits the slaves to the free execute message, processing continues at block 710 to 

pool, 15 invoke SlaveExecute, and processing continues at block 702 

FIG. 16 illustrates a process flow for Close. At decision to process execution messages, 

block 1601 (i.e„ "'Close* message expected by slaves?"), if If, at decision block 708, the execution message is not an 

a close message is expected by the slaves, SendCloseMsg at execution message, process continues at decision block 712. 

block 1604, Stop is invoked at block 1606. Flags are cleared At decision block 712 (i.e., "message is 'resume*?"), if the 

at block 1608, and processing ends at block 1610. execution message is a resume message, processing contin- 

FIG. 17 illustrates a process flow for SendCloseMsg. At ues at block 714 to invoke SlaveResume, and processing 

block 1702, DFO is set to the first executed DFO. At continues at block 702 to process execution messages. If the 

decision block 1704 (i.e., "no current DFO or current DFO message is not a resume message, processing continues at 

not parallel?"), if there is not current DFO or the current decision block 716. At decision block 716 (i.e., "message is 

DFO is not parallel, processing ends at block 1714, If not, 'close'?"), if the execution message is a close message, 

processing continues at decision block 1706. processing continues at block 718 to invoke SlaveClose. If 

At decision block 1706 (i.e., "DFO found?"), if a DFO is message is not a close message, processing continues at 

not found, processing ends at block 1714. If a DFO is found, decision block 702 to process execution messages, 

processing continues at decision block 1708. At decision SlaveParse 
block 1708 (i.e., "DFO slaves expecting close message?"), 

if the DFO is expecting a close message, processing con- A parse execution message is sent after it is determined 

tinues at block 1710 to send a close message to each of the that the DFO SQL statements must be parsed before execu- 

slaves in the set, and processing continues at decision block don. FIG. 7B illustrates a process flow for a slave DFO 

1716. If the DFO is not expecting a dose message, process- ^5 processing a parse message. At block 720, a database cursor 

ing continues at decision block 1716. is opened for each DFO. At block 722, each DFO SQL 

At decision block 1716 (i.e., "DFO=current DFO?"), if statement is parsed. Processing block 724 binds all SQL 

the DFO is the current DFO, processing ends at block 1714, statement inputs and defines all output values. At processing 

If it is not the current DFO, then processing continues at block 726. the parsed cursor numbers are returned to the QC, 

block 1716 to get the next DFO, and processing continues at 40 and the SlaveParse process ends, 

decision block 1706 to process any remaining DFOs. SlaveExecute 

FIG. 19 illustrates a Stop process flow. At decision block 

1902 (i.e., "Serial process?"), if the process is a serial If an execute message is received from the QC, the slave 

process, processing continues at block 1904 to close the DFO receiving the message must execute the DFO. FIG. 7C 

underlying row source, and processing ends at block 1610. 45 illustrates a process flow for executing a DFO. At decision 

If the process is not a serial process, processing continues at block 730 (i.e., first execute of this DFO?*'), if this is not the 

block 1906. At block 1906, the slaves are closed, and first execution message received for this DFO, processing 

deleted, if necessary. At block 1908, current DFO and continues at block 746 to invoke SlaveFetch to fetch all 

current ou^ut TQ are cleared. Processing ends at block rows, and processing ends at block 748. 

1610. 50 If this is the first execution message received, processing 

continues at decision block 732 (i.e. , QC expects ' started' ?") 

Row Operator determine whether the QC expects a reply indicating that 

The present invention provides the ability to pass a the slave has started. If yes, processing continues at block 

routine from a calling row source to an underlying row 734 to send a "started" message to the QC, and processing 

source. The routine can be used by the underlying row 55 continues at block 736. If not, processing continues at block 

source to perform a function for the calling row source. For 736. 

example, a caUing row source can call an underlying row Block 736 processes bind variables, and executes the 

source and pass a routine to the underlying row source to cursor. At block 738, a "done" replies are sent to QC for all 

place the row sources in a location for the calling row of the child DFOs of the DFO being executed. At decision 

source. Once the underlying routine has produced the rows, 60 block 740 (i.e., **QC expects 'ready* replies?"), if the QC 

the underlying row source can use the callback routine to expects a ready message to indicate that the slave DFO is 

place the row sources in a data store location (e.g., database ready to fetch rows, processing continues at block 742. At 

or table queue). block 742, one row is fetched from the DFO cursor. Pro- 

\rc TOnrTJccuc cessing continues at block 744 to send a "ready** reply to the 

SLAVE PROCESSES qq processing ends. If the QC does not expect a ready 

A slave DFO receives execution messages from the message, processing continues at block 746 to fetch all rows 

dataflow scheduler. For example, a slave DFO may receive from the DFO cursor, and processing ends at block 748. 



